-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vcf int64 part2 #1004
Vcf int64 part2 #1004
Conversation
Any 64-bit INFO field that wasn't the last in the list would cause subsequent fields to be decoded incorrectly. This commit fixes that, plus updates the tests accordingly so the bug could be triggered. Fixes the first part of samtools#999 (test1.vcf), but doesn't fix the second part (BCF output silently being broken). Fixes samtools/bcftools#1123
You should also check To reduce concerns about change in behaviour due to 64 bit INFO tags, it should be fairly easy to restrict them to just the |
This goes in the right direction, but should build on top of #1003. I agree with forbidding writing out BCFs with 64-bit values. But the problem remains: one out-of-range value will turn on 64-bit mode and will make it impossible to write out an otherwise valid BCF. It should not be compiled in by default, unless you know you are going to work with large genomes. There are terabytes of invalid VCF/BCFs sitting on sanger's drives, but I have not seen a single 64-bit genome yet. It should be the rare case that gets inconvenienced, not the major use case. |
If the problem really is that endemic, that we have thousands of invalid files and that we must be able to transform these invalid files in / out of BCF silently (with corresponding broken data due to lack of support), then for this immediate problem we should simply disable the 64-bit VCF code so we can take time to do things "right" for the next release (where "right" probably means issue a warning and transform the values to 'missing'). However that's also not your PR which amended the BCF format. Do you have a simple PR that does this? We don't have time to do something complex as it will require a huge amount of testing which simply won't get done, at least not by me, due to insufficient time this year. |
Hm, I did consider the PR simple. Anything outside 64bit branch is just turning large values into missing ones and printing a warning. I have tests in bcftools ready, which are waiting for the change in htslib. |
Also adds check for rlen (END) field when trying to output to BCF. Note: this will emit a warning per out-of-range value, but it is expected that the frequency of these is very low. If not, consider this a good icentive to fix your pipelines!
@pd3 how about this latest commit? It is still simple. It's only 20-odd lines of new code which is about 1/3rd the size of your PR and contains fewer problematic things (like mixing long with int64_t, strtol vs strtoll, and creation of non-compliant BCF files) while offering better diagnostics and reporting to the user. It takes @daviesrob's idea of only permitting the internal representation on the pos field and the END info tag. All other out of range info tags now get converted to "missing" and a warning is issued. It also tracks when the VCF is not representable in BCF and refuses to write one out. That can only possibly happen when BCF itself cannot be used, so it is not a regression in any shape or form. |
Superseded by #1003 |
NOTE: this is branched off my vcf_int64 PR #1000 so it contains that commit also.
This adds extra checks to vcf parsing to set a flag to indicate the record contains 64-bit data which can only be written back to VCF, and not to BCF.
It's probably not complete (I don't check FORMAT data, only INFO and POS), but I'm not sufficiently familier with the vcf parser and format to know where 64-bit data can occur. I think it's just POS and INFO (for END)?
Anyway, this is a proof of principle, hence in its own separate PR so we can merge the other one if needed without too many shenanigans.